Stochastic Shortest Path Games and Q-Learning
نویسنده
چکیده
We consider a class of two-player zero-sum stochastic games with finite state and compact control spaces, which we call stochastic shortest path (SSP) games. They are total cost stochastic dynamic games that have a cost-free termination state. Based on their close connection to singleplayer SSP problems, we introduce model conditions that characterize a general subclass of these games that have strong properties: the value function exists and is the unique solution of the Bellman equation, both players have optimal policies that are stationary deterministic, and the value iteration algorithm, as well as the policy iteration algorithm starting with certain wellbehaved policies, converge. We then consider the classical Q-learning algorithm that computes the value function for finite state and control SSP games that satisfy our model conditions. Q-learning is a model-free, asynchronous stochastic iterative algorithm, and by the theory of stochastic approximation involving monotone nonexpansive mappings, it is known to converge when the Bellman equation has a unique solution and its iterates are bounded with probability one. We prove the boundedness of the Q-learning iterates and thereby establish completely the convergence of Q-learning for our broad class of SSP game models. Dec 2011
منابع مشابه
Reinforcement Learning for Average Reward Zero-Sum Games
We consider Reinforcement Learning for average reward zerosum stochastic games. We present and analyze two algorithms. The first is based on relative Q-learning and the second on Q-learning for stochastic shortest path games. Convergence is proved using the ODE (Ordinary Differential Equation) method. We further discuss the case where not all the actions are played by the opponent with comparab...
متن کاملAn Online Convergent Q-learning Algorithm with Linear Function Approximation
We present in this article a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm is convergent. Numerical results on a multi-stage stochastic shortest path problem show t...
متن کاملQ-learning and policy iteration algorithms for stochastic shortest path problems
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1...
متن کاملLIDS REPORT 2871 1 Q - Learning and Policy Iteration Algorithms for Stochastic Shortest Path Problems ∗
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the s...
متن کاملOn Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems
We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path (SSP) problems, which are total cost Markov decision processes with an absorbing and cost-free state. For the most commonly used SSP models, existing convergence proofs assume that the sequence of Q-learning iterates is bounded with probability one, or some other ...
متن کامل